External sorting

External sorting is a term for a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory (usually a hard drive). External sorting typically uses a sort-merge strategy. In the sorting phase, chunks of data small enough to fit in main memory are read, sorted, and written out to a temporary file. In the merge phase, the sorted subfiles are combined into a single larger file.

Contents

External merge sort

One example of external sorting is the external merge sort algorithm, which sorts chunks that each fit in RAM, then merges the sorted chunks together.[1][2] For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:

  1. Read 100 MB of the data in main memory and sort by some conventional method, like quicksort.
  2. Write the sorted data to disk.
  3. Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to be merged into one single output file.
  4. Read the first 10 MB (= 100MB / (9 chunks + 1)) of each sorted chunk into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.)
  5. Perform a 9-way merge and store the result in the output buffer. If the output buffer is full, write it to the final sorted file, and empty it. If any of the 9 input buffers gets empty, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available. This is the key step that makes external merge sort work externally -- because the merge algorithm only makes one pass sequentially through each of the chunks, each chunk does not have to be loaded completely; rather, sequential parts of the chunk can be loaded as needed.

Additional passes

That example shows a two-pass sort: a sort pass followed by a merge pass. Note that we had one merge pass that merged all the chunks at once, rather than in regular merge sort, where we merge two chunks at each step, and take \log n merge passes total. The reason for this is that every merge pass requires reading and writing every value in the array from and to disk once. Disk access is usually slow, and so reads and writes should be avoided as much as possible.

However, there is a trade-off with using fewer merge passes. As the number of chunks increases, the amount of data we can read from each chunk at a time during the merge process decreases. For sorting, say, 50 GB in 100 MB of RAM, using a single merge pass isn't efficient: the disk seeks required to fill the input buffers with data from each of the 500 chunks (we read 100MB / 501 ~ 200KB from each chunk at a time) take up most of the sort time. Using two merge passes solves the problem. Then the sorting process might look like this:

  1. Run the initial chunk-sorting pass as before.
  2. Run a first merge pass combining 25 chunks at a time, resulting in 20 larger sorted chunks.
  3. Run a second merge pass to merge the 20 larger sorted chunks.

Like in-memory sorts, efficient external sorts require O(n log n) time: exponential increases in data size require linear increases in the number of passes. If one makes liberal use of the gigabytes of RAM provided by modern computers, the logarithmic factor grows very slowly: under reasonable assumptions, one could sort at least 500 GB of data using 1 GB of main memory before a third pass became advantageous, and could sort many times that before a fourth pass became useful.[3]

Tuning performance

The Sort Benchmark, created by computer scientist Jim Gray, compares external sorting algorithms implemented using finely tuned hardware and software. Winning implementations use several techniques:

Other algorithms

External merge sort is not the only external sorting algorithm; there are also distribution sorts, which work by partitioning the unsorted values into smaller "buckets" that can be sorted in main memory. Like merge sort, external distribution sort also has a main-memory sibling; see bucket sort. There is a duality, or fundamental similarity, between merge- and distribution-based algorithms that can aid in thinking about sorting and other external memory algorithms.[6] There are in-place algorithms for external sort, which require no more disk space than the original data.

External links

References

  1. ^ Donald Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998, ISBN 0-201-89685-0, Section 5.4: External Sorting, pp.248–379.
  2. ^ * Ellis Horowitz and Sartaj Sahni, Fundamentals of Data Structures, H. Freeman & Co., ISBN 0-7167-8042-9.
  3. ^ Assume a single disk with 200 MB/s transfer, 20 ms seek time, 1 GB of buffers, 500 GB to sort. The merging phase will have 500 buffers of 2M each, need to do 250K seeks and read then write 500 GB. It will spend 5,000 sec seeking and 5,000 sec transferring. Doing two passes as described above would nearly eliminate the seek time but add an additional 5,000 sec reading and writing, so this is approximately the break-even point between a two-pass and three-pass sort.
  4. ^ Nikolas Askitis, OzSort 2.0: Sorting up to 252GB for a Penny
  5. ^ Rasmussen et al., TritonSort
  6. ^ J. S. Vitter, Algorithms and Data Structures for External Memory, Series on Foundations and Trends in Theoretical Computer Science, now Publishers, Hanover, MA, 2008, ISBN 978-1-60198-106-6.